[FLINK-21309][metrics] Add UUID as grouping key for prom pushgateway#27744
[FLINK-21309][metrics] Add UUID as grouping key for prom pushgateway#27744qinghui-xu wants to merge 2 commits intoapache:masterfrom
Conversation
Previously we are relying on random uuid suffix of `job` label to avoid metric collisions among taskmangers. This makes it difficult to aggregate over `job` label. Now we use the uuid as grouping key instead, while respecting prometheus guide line on `job` label usage.
|
@flinkbot run azure |
| */ | ||
| @PublicEvolving | ||
| public class PrometheusPushGatewayReporter extends AbstractPrometheusReporter implements Scheduled { | ||
| public static final String REPORTER_ID_GROUPPING_KEY = |
There was a problem hiding this comment.
Minor spelling nit: I think we want this to be called `REPORTER_ID_GROUPING_KEY.
| .defaultValue("") | ||
| .withDescription("The job name under which metrics will be pushed"); | ||
|
|
||
| public static final ConfigOption<Boolean> RANDOM_JOB_NAME_SUFFIX = |
There was a problem hiding this comment.
Are we sure we want to remove this configuration entirely? I’d imagine there are existing jobs and/or users already relying on it. This would also require updating the associated documentation and, at a minimum, feels like something that should be deprecated before being removed.
There was a problem hiding this comment.
Good point. Let me deprecate it in the PR. I'm wondering should I also change the default value to false while keeping it. I'll use this flag as a switch: if job is not suffixed, a "flink_prometheus_push_gateway_reporter_id" grouping key will be used.
There was a problem hiding this comment.
Sent a followup commit in this sense.
|
|
||
| GroupingKeyMap(Map<String, String> customGroupingKeys) { | ||
| if (customGroupingKeys.containsKey(REPORTER_ID_GROUPPING_KEY)) { | ||
| throw new IllegalArgumentException( |
There was a problem hiding this comment.
We'll want to add a test here to cover the assertion against the user supplying "flink_prometheus_push_gateway_reporter_id" as an explicit grouping (asserting the expected exception).
- Keep `randomJobNameSuffix` config as deprecated - When `randomJobNameSuffix` is disabled (by default), group metrics using random reporter id
| * grouping keys, so that each taskmanger instance and jobmanager will not collide their metrics | ||
| * in the PushGateway. | ||
| */ | ||
| static class GroupingKeyMap extends AbstractMap<String, String> { |
There was a problem hiding this comment.
The custom GroupingKeyMap feels heavier than necessary for what is effectively appending one synthetic entry.
Since the goal is just to preserve user-defined grouping keys and add REPORTER_ID_GROUPING_KEY as the last entry, would a LinkedHashMap copy be simpler and easier to reason about?
For example:
Map<String, String> grouping = new LinkedHashMap<>(groupingKey);
grouping.put(REPORTER_ID_GROUPING_KEY, reporterId);
That avoids reimplementing AbstractMap, iterator behavior, and custom entrySet() semantics.
| "Grouping keys must not contain the reserved key: " | ||
| + REPORTER_ID_GROUPPING_KEY); | ||
| } | ||
| this.customGroupingKeys = customGroupingKeys; |
There was a problem hiding this comment.
Would it make sense to defensively copy customGroupingKeys here?
Using new LinkedHashMap<>(customGroupingKeys) would avoid accidental external mutation after construction and also guarantees stable iteration order.
| ConfigOptions.key("randomJobNameSuffix") | ||
| .booleanType() | ||
| .defaultValue(true) | ||
| .defaultValue(false) |
There was a problem hiding this comment.
Changing the default from true to false introduces a behavioral change for existing deployments.
Even though the option is deprecated, preserving the current default for one deprecation cycle may be safer, then switching defaults in a major compatibility window.
| .defaultValue(false) | ||
| .withDescription( | ||
| "Specifies whether a random suffix should be appended to the job name."); | ||
| "Specifies whether random suffixing `job` label (the old way) to avoid metric collision among reporters from taskamangers." |
There was a problem hiding this comment.
The option description could use a wording pass:
taskamangers->taskmanagersrandom suffixing job labelreads a bit awkwardly
Current wording is understandable, but polishing it would improve config readability.
What is the purpose of the change
Previously we are relying on random uuid suffix of
joblabel to avoid metric collisions among taskmangers. This makes it difficult to aggregate overjoblabel.Now we use the uuid as grouping key (
flink_prometheus_push_gateway_reporter_id) instead, while respecting prometheus guide line onjoblabel usage.Brief change log
randomJobNameSuffixflag from the config optionsjoblabel will not be suffixedVerifying this change
Please make sure both new and modified tests in this PR follow the conventions for tests defined in our code quality guide.
This change added tests and can be verified as follows:
Does this pull request potentially affect one of the following parts:
@Public(Evolving): (no)Documentation